Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combine two data sets | Variable names are different


    Dear Stata listers, I have to merge two data sets. Both have a unique identifier. Unfortunately, most (if not all) of the other variables have different names. For example, in one I would have pdead (participant dead) and in the other it is palive (participant alive). Five years ago somebody posted a similar question: http://www.stata.com/statalist/archi.../msg00394.html
    I could check the variables one by one. This will be time consuming and tedious. If there is any way to merge these two data sets easier? Thanks Best

  • #2
    Rodrigo,

    Unfortunately the answer (or non-answer) now is the same as it was then. Stata cannot read minds so there is no way for it to know what variables are the same (or similar) in two different data sets unless you give the variables the same name. What did you have in mind?

    Regards,
    Joe

    Comment


    • #3
      Hi Rodrigo and Joe,

      Why can't Stata search through the datasets that are to be merged and list unique observations of all variables and recommend "merge by" or "join by" variables based on those variables that have the same unique identifiers? This would likely take a long time and, with large datasets, not be preferable to manually going through and renaming "merge by" variables to match across datasets. Alternatively, is there a way to tell Stata when merging that you want the variable ID1 in dataset #1 to be matched with the variable ID2 in dataset #2?--So, ID1 and ID2 are the same unique identifiers just different variable names.

      Gio

      Comment


      • #4
        Here, an excerpt from the Stata Manual concerning - merge - command:

        varlist specifies variables common to
        both datasets that together uniquely identify single observations in both datasets
        . For instance, suppose
        you have a dataset of customer information, called customer.dta, and have a second dataset of other
        information about roughly the same customers, called other.dta. Suppose further that both datasets
        identify individuals by using the pid variable, and there is only one observation per individual in
        each dataset. You would merge the two datasets by typing
        . use customer
        . merge 1:1 pid using other
        That said, if you are sure every row of one dataset corresponds to the other, you may try this command:

        Code:
        . */ One-to-one merge by observation
        
                merge 1:1 _n using filename
        Last edited by Marcos Almeida; 29 Aug 2018, 08:45.
        Best regards,

        Marcos

        Comment


        • #5
          I think there may be a solution to this problem, provided I'm not misinterpreting the issue.

          If the OP's problem is that the unique identifier used to merge the two data sets is named differently in the using data set, then I think one can use a combination of -cfvars- to obtain all variable names from the using data set and -mmerge- to loop merge attempts between the two data sets by all variabels in the using data set.

          I'm imagining something like:

          Code:
          // Create using dataset
          sysuse auto , clear
          egen idfalse = seq()
          egen idtrue = group(make price)
          keep in 1/5
          gen newvariable = runiform(0,1)
          drop make-foreign
          save auto2 , replace
          
          // Get list of variable names in using data set
          // Note: I don't think this is the best way to do this
          clear
          gen notarealvariable = 0
          cfvars auto2.dta
          
          // Loop through using data set variable names and attemp merge
          // If not all observations from using dataset are matched,
          // then go to next varaible in the using dataset
          foreach x in `r(oneonly)' {
              sysuse auto , clear
              egen id = group(make price)
              capture mmerge id using auto2.dta , umatch(`x')
              capture local uobs = `r(uobs)'
              capture count if _merge == 3
              capture local match = `r(N)'
              if `uobs' != . {
                  assert `match' != `uobs'    
              }
          }
          If a complete match is found, execution will cease and -assert- will display "assertion is false".

          So this seems like a very bad idea to begin with, but if the using data set has hundreds or thousands of variables--only one of which being a unique identifier that matches the master data set--then I can imagine how it could save some time. It's important to note that this assumes you know the unique identifier in the master data set and that you are only interested in a complete match between the master and using data set (i.e. all observations from using are matched to master). Also note that -capture- is added because if the merge-by variable type does not match, -mmerge- will throw an error. Finally, if in the using data set there are variables which are not unique identifiers but all values match values of the unique id in the master data set then this method fails.

          Both -mmerge- and -cfvars- are available on SSC. As a side note, for those just wanting to merge two data sets with ids that are named differently--without going into the using data set or renaming the ids in the master data set--I've found -mmerge- to be very useful.
          Last edited by Giovanni Colitti; 07 Dec 2018, 10:42.

          Comment

          Working...
          X